1 Getting started

DEbChIP is an R package distributed as part of the Bioconductor project and CRAN. To install the package, start R and enter:

# install via Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
  install.packages("BiocManager")
BiocManager::install("DEbChIP")
# install via Github
# install.package("remotes")   #In case you have not installed it.
remotes::install_github("showteeth/DEbChIP")

Once DEbChIP is installed, it can be loaded by the following command.

library("DEbChIP")

2 Introduction

The goal of DEbChIP is to assist in the processing of RNA-seq data. It contains six functional modules:

  • Quality Control (QC): QC on count matrix and samples.
    • QC on count matrix: Proportion of genes detected in different samples under different CPM thresholds and the saturation of the number of genes detected.
    • QC on samples: Euclidean distance and pearson correlation coefficient of samples across different conditions, sample similarity on selected principal components (check batch informatio and conduct batch correction) and outlier detection with robust PCA.
  • Principal Component Analysis (PCA): this module can be divided into three sub modules, basic info, loading related and 3D visualization.
    • basic info: screen plot (help to select the useful PCs), biplot (sample similarity with corresponding loading genes) and PC pairs plot (sample similarity under different PC combinations).
    • loading related: visualize positive and negative loading genes on selected PCs, conduct GO enrichment analysis on selected loading genes of selected PC.
    • 3D visualization: visualize samples on three selected PCs.
  • Differential Expression Gene Visualization: this module includes six powerful visualization methods (Volcano Plot, Scatter Plot, MA Plot, Rank Plot, Gene Plot, Heatmap).
  • Functional Enrichment Analysis (FEA): GO enrichment analysis, KEGG enrichment analysis, Gene Set Enrichment Analysis (GSEA).
    • GO (Biological Process, Molecular Function, Cellular Component) and KEGG on differential expression genes
    • GSEA on all genes
  • Integrate with ChIP-seq:
    • Get consensus peaks: For multiple peak files, get consensus peaks; for single peak file, use it directly
    • Peak profile plots: Heatmap of ChIP binding to TSS regions, Average Profile of ChIP peaks binding to TSS region, Profile of ChIP peaks binding to different regions.
    • Peak annotaion
    • Integrate with RNA-seq: Integrate with RNA-seq to find direct targets, including up-regulated and down-regulated.
    • Integrate summary: Summary the integrated results, get the overlap number of up-regulated genes and ChIP-seq results (UP), down-regulated genes and ChIP-seq results.
    • GO enrichment on integrated results: GO enrichment on up-regulated targets and down-regulated targets.
  • Utils: useful functions when dealing with RNA-seq data, including gene name conversion and count normalization(DESeq2’s median of ratios, TMM, CPM, TPM, RPKM).

To enhance the ease of use of the tool, we have also developed an interactive tool for DEbChIP that allows users to submit files to the web page and set parameters to get the desired results. Unlike the standalone R package, the interactive web page has built-in DESeq2 for differential expression analysis, while the R package can accept user input results from DESeq2 or edgeR, which will be more flexible.

By the way, all plots generated are publication-ready, and most of them are based on ggplot2, so that users can easily modify them according to their needs.


3 Load the data

The data used here are RNA-seq data of the external granule layer in the cerebellum of control and conditional SnoN knockout mice, the raw data are stored in GSE120279, they are used in Robust principal component analysis for accurate outlier sample detection in RNA-seq data.

First, we create DESeqDataSet object with above dataset:

suppressWarnings(suppressMessages(library(DESeq2)))
suppressWarnings(suppressMessages(library(DEbChIP)))
count.file <- system.file("extdata", "snon_count.txt", package = "DEbChIP")
meta.file <- system.file("extdata", "snon_meta.txt", package = "DEbChIP")
count.matrix <- read.table(file = count.file, header = T, sep = "\t")
head(count.matrix)
##                        KO1  KO2  KO3  KO4  KO5  KO6  WT1  WT2  WT3  WT4  WT5
## ENSMUSG00000000001.4  4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15    0    0    0    0    0    0    0    0    0    0    0
## ENSMUSG00000000028.14  350  579  435  316  432  317  245  621  419  506  545
## ENSMUSG00000000031.16  268  804   66  207   46   66  336  112   60   69  137
## ENSMUSG00000000037.16  262  157  184  162  301  233  311  176   94  139  229
## ENSMUSG00000000049.11    0    0    0    0    0    0    1    0    2    0    1
##                        WT6
## ENSMUSG00000000001.4  3625
## ENSMUSG00000000003.15    0
## ENSMUSG00000000028.14  371
## ENSMUSG00000000031.16  193
## ENSMUSG00000000037.16  186
## ENSMUSG00000000049.11    0
meta.info <- read.table(file = meta.file, header = T)
head(meta.info)
##     condition
## KO1        KO
## KO2        KO
## KO3        KO
## KO4        KO
## KO5        KO
## KO6        KO
dds <- DESeq2::DESeqDataSetFromMatrix(countData = count.matrix, colData = meta.info, design = ~condition)
dds
## class: DESeqDataSet 
## dim: 53379 12 
## metadata(1): version
## assays(1): counts
## rownames(53379): ENSMUSG00000000001.4 ENSMUSG00000000003.15 ...
##   ENSMUSG00000115849.1 ENSMUSG00000115850.1
## rowData names(0):
## colnames(12): KO1 KO2 ... WT5 WT6
## colData names(1): condition

4 Quality Control

Quality control are of vital importance to get realiable results. Here, quality control includes two main aspects: sequencing saturation (QC on count matrix) and sample similarity (QC on samples).

4.1 QC on count matrix

4.1.1 CPM threashold

This plot shows proportion of genes detected in different samples at different CPM thresholds. Theoretically, all samples should be similarly distributed, so as to avoid false positive results obtained from detection problems.

CountQC(deobj = dds, group.key = "condition", type = "cpm")
## Differential expression analysis with DESeq2!
## [1] "Warning: 25096 features with 0 counts in all samples are to be removed for this analysis."
## [1] "Count distributions are to be computed for:"
##  [1] "KO1" "KO2" "KO3" "KO4" "KO5" "KO6" "WT1" "WT2" "WT3" "WT4" "WT5" "WT6"


4.1.2 sequencing saturation

This plot shows the proportion of genes detected at different sequencing depths. If the proportion is too low, it indicates that many genes are not detected and therefore some important information may be missing, and increasing the sequencing volume can effectively solve this problem.

CountQC(deobj = dds, group.key = "condition", type = "saturation")
## Differential expression analysis with DESeq2!


4.2 QC on samples

QC on samples including four aspects:

  • Euclidean distance and pearson correlation coefficient: sample similarity on the entire transcriptome.
  • PCA: sample similarity on selected principal components.
  • outlier detection with robust PCA
  • batch correction

4.2.1 Euclidean distance and pearson correlation coefficient

SampleRelation(deobj = dds,transform.method = "rlog",anno.key = "condition")
## Differential expression analysis with DESeq2!


4.2.2 PCA

qc.pca.res=PCA(deobj = dds,transform.method = "rlog")
## Differential expression analysis with DESeq2!
## Use all genes for PCA!
PCAtools::biplot(qc.pca.res,x="PC1",y="PC2",colby="condition",legendPosition="bottom")


4.2.3 outlier detection with robust PCA

outlier.res=OutlierDetection(deobj = dds,var.genes = NULL,transform.method = "rlog")
## Differential expression analysis with DESeq2!
## Use all genes for PCA!
## Detecting 2 outlier(s): KO3,WT1
outlier.res$outlier
## [1] "KO3" "WT1"
outlier.res$plot

For batch correction and outlier detection, you can use the following codes:

# batch effect correction
# this is a demo scripts
batch.res=QCPCA(deobj = dds,var.genes = NULL,remove.sample=NULL,transform.method = "rlog",batch = "cell",
                colby = "dex", outlier.detection = F)
batch.res$plot
# outlier detection
outlier.res=QCPCA(deobj = dds,var.genes = NULL,remove.sample=NULL,transform.method = "rlog",
                  outlier.detection = T,rpca.method = "PcaGrid")
outlier.res$plot
# the final dds
dds=outlier.res$deobj # or outlier.res$plot

5 Principal Component Analysis

5.1 basic info

# conduct PCA
pca.res=PCA(deobj = dds,remove.sample = outlier.res$outlier,transform.method = "rlog")
## Differential expression analysis with DESeq2!
## Use all genes for PCA!
# get basic plots
basic.plots=PCABasic(pca.res,colby="condition",legend.pos = "right")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

5.1.1 screen plot

basic.plots[["screen"]]

5.1.2 biplot

basic.plots[["biplot"]]

5.1.3 PC pairs plot

basic.plots[["pairs"]]


5.3 3D visualization

PCA3D(pca = pca.res,color.key = "condition",main = "3D PCA")


6 Differential Expression Gene Visualization

6.1 Differential Expression Analysis

First you should get differential expression genes. For DESeq2 results:

# set control level
dds$condition <- relevel(dds$condition, ref = "WT")
# conduct differential expressed genes analysis
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# extract results
dds.results <- results(dds,contrast=c("condition",'KO','WT'))
dds.results.ordered <- dds.results[order(dds.results$log2FoldChange,decreasing = TRUE),]
head(dds.results.ordered)
## log2 fold change (MLE): condition KO vs WT 
## Wald test p-value: condition KO vs WT 
## DataFrame with 6 rows and 6 columns
##                        baseMean log2FoldChange     lfcSE      stat      pvalue
##                       <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSMUSG00000025934.15   2.01093        4.46030   3.01742   1.47818          NA
## ENSMUSG00000105597.1    1.64376        4.16931   2.98872   1.39502 0.163011113
## ENSMUSG00000109866.1    2.32064        4.14573   1.22504   3.38415 0.000713987
## ENSMUSG00000085819.7    2.23441        4.07496   1.65939   2.45569 0.014061354
## ENSMUSG00000102474.1    1.52867        4.06781   1.80781   2.25013 0.024440396
## ENSMUSG00000042677.7    1.49781        4.05394   1.71350   2.36588 0.017987495
##                            padj
##                       <numeric>
## ENSMUSG00000025934.15        NA
## ENSMUSG00000105597.1    0.99973
## ENSMUSG00000109866.1    0.99973
## ENSMUSG00000085819.7    0.99973
## ENSMUSG00000102474.1    0.99973
## ENSMUSG00000042677.7    0.99973

For edgeR results:

library(edgeR)
## Loading required package: limma
## 
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
## 
##     plotMA
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
snon.edgeR=DGEList(counts=count.matrix, group=meta.info$condition)
keep <- filterByExpr(snon.edgeR,min.count=10)
snon.edgeR <- snon.edgeR[keep, , keep.lib.sizes=FALSE]
snon.edgeR <- calcNormFactors(snon.edgeR)
snon.edgeR$samples$group <- relevel(snon.edgeR$samples$group, ref="WT")
design <- model.matrix(~snon.edgeR$samples$group)
snon.edgeR <- estimateDisp(snon.edgeR, design)
fit <- glmQLFit(snon.edgeR, design)
qlf <- glmQLFTest(fit, coef=2)
all.res <- topTags(qlf,n=nrow(snon.edgeR$counts))$table
head(all.res)
##                            logFC    logCPM        F       PValue         FDR
## ENSMUSG00000039620.17  2.1925630 1.7076072 70.78781 2.686429e-07 0.003577786
## ENSMUSG00000046152.16 -0.7238661 4.9567684 36.89615 1.550592e-05 0.103253915
## ENSMUSG00000022376.8  -1.7356149 0.3243428 21.61086 2.617267e-04 0.998523400
## ENSMUSG00000027660.16 -0.9581197 5.3679317 19.98078 3.791045e-04 0.998523400
## ENSMUSG00000031478.16  0.8734604 2.4708238 18.99645 4.783255e-04 0.998523400
## ENSMUSG00000079469.10 -1.2129059 1.4296852 18.36878 5.568020e-04 0.998523400

6.2 Visualization

6.2.1 VolcanoPlot

# VolcanoPlot for DESeq2
VolcanoPlot(dds.results.ordered,signif="pvalue",l2fc.threashold=0.3,label.num=2,
            point.alpha = 0.8, label.color=c("purple","green"),tick.trans = NULL)
## Differential expression analysis with DESeq2!

# VolcanoPlot for edgeR
VolcanoPlot(all.res,signif="PValue",l2fc.threashold=0.3,label.num=2,point.alpha = 0.8,
            label.color=c("purple","green"),tick.trans = NULL)
## Differential expression analysis with edgeR!

6.2.2 ScatterPlot

# ScatterPlot for DESeq2
ScatterPlot(deobj = dds,deres = dds.results.ordered,group.key = "condition",
            ref.group = "WT",signif="pvalue",l2fc.threashold=0.3,label.num = 2,
            point.alpha = 0.8,label.color=c("purple","green"))
## Differential expression analysis with DESeq2!
## Loading required package: ggplot2

# ScatterPlot for edgeR
ScatterPlot(deobj = snon.edgeR,deres = all.res,group.key = "condition",
            ref.group = "WT",signif="PValue",l2fc.threashold=0.3,label.num = 2,
            point.alpha = 0.8,label.color=c("purple","green"))
## Differential expression analysis with edgeR!

6.2.3 MAPlot

# MAPlot for DESeq2
MAPlot(dds.results.ordered,signif="pvalue",l2fc.threashold=0.3,label.num=2,
       point.alpha = 0.8, label.color=c("purple","green"))
## Differential expression analysis with DESeq2!

# MAPlot for edgeR
MAPlot(all.res,signif="PValue",l2fc.threashold=0.3,label.num=2,point.alpha = 0.8,
       label.color=c("purple","green"))
## Differential expression analysis with edgeR!

6.2.4 RankPlot

# RankPlot for DESeq2
RankPlot(dds.results.ordered,signif="pvalue",l2fc.threashold=0.3,label.num=2,
         point.alpha = 0.8, label.color=c("purple","green"))
## Differential expression analysis with DESeq2!

# RankPlot for edgeR
RankPlot(all.res,signif="PValue",l2fc.threashold=0.3,label.num=2,point.alpha = 0.8,
         label.color=c("purple","green"))
## Differential expression analysis with edgeR!

6.2.5 GenePlot

# GenePlot for DESeq2
GenePlot(deobj = dds,deres = dds.results.ordered,group.key = "condition",
         ref.group = "WT",fill.color=c("red","blue"), fill.alpha = 0.8,
         gene.num =2,signif="pvalue",l2fc.threashold=0.3)
## Differential expression analysis with DESeq2!

# GenePlot for edgeR
GenePlot(deobj = snon.edgeR,deres = all.res,group.key = "condition",
         ref.group = "WT",fill.color=c("red","blue"),fill.alpha = 0.8,
         gene.num =2,signif="PValue",l2fc.threashold=0.3)
## Differential expression analysis with edgeR!

6.2.6 DEHeatmap

# DEHeatmap for DESeq2
DEHeatmap(deobj = dds,deres = dds.results.ordered,group.key = "condition",
          ref.group = "WT", signif="pvalue",l2fc.threashold=0.3)
## Differential expression analysis with DESeq2!

# DEHeatmap for edgeR
DEHeatmap(deobj = snon.edgeR,deres = all.res,group.key = "condition",
          ref.group = "WT", signif="PValue",l2fc.threashold=0.3)
## Differential expression analysis with edgeR!


7 Functional Enrichment Analysis

7.1 GO enrichment

# save results to working directory
# ConductFE(deres = dds.results.ordered, signif = "pvalue", l2fc.threashold = 0.3,enrich.type = "GO")
all.go.results=ConductFE(deres = dds.results.ordered, signif = "pvalue", l2fc.threashold = 0.3,enrich.type = "GO",str.width = 50, save = F)
## Differential expression analysis with DESeq2!
## Convert ENSEMBL to ENTREZID!
## 'select()' returned 1:many mapping between keys and columns
## conduct ALL GO enrichment analysis on: UP
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Convert ENSEMBL to ENTREZID!
## 'select()' returned 1:1 mapping between keys and columns
## conduct ALL GO enrichment analysis on: DOWN
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

7.1.1 Up regulated

7.1.1.1 overview

up.go.res=all.go.results[["UP"]][["GO"]]
up.go.res.table=up.go.res[["table"]]
head(up.go.res.table)
##            ONTOLOGY         ID                              Description
## GO:0097060       CC GO:0097060                        synaptic membrane
## GO:0099240       CC GO:0099240 intrinsic component of synaptic membrane
## GO:0014069       CC GO:0014069                     postsynaptic density
## GO:0032279       CC GO:0032279                       asymmetric synapse
## GO:0098839       CC GO:0098839            postsynaptic density membrane
## GO:0098984       CC GO:0098984                 neuron to neuron synapse
##            GeneRatio   BgRatio       pvalue   p.adjust     qvalue
## GO:0097060    15/231 470/23271 7.783967e-05 0.01822125 0.01447238
## GO:0099240    10/231 244/23271 1.732798e-04 0.01822125 0.01447238
## GO:0014069    13/231 400/23271 1.957825e-04 0.01822125 0.01447238
## GO:0032279    13/231 405/23271 2.208636e-04 0.01822125 0.01447238
## GO:0098839     6/231  97/23271 4.175021e-04 0.01868255 0.01483877
## GO:0098984    13/231 433/23271 4.183638e-04 0.01868255 0.01483877
##                                                                                                  geneID
## GO:0097060 Slc6a2/Met/Nt5e/Lpar2/Grik1/Tenm2/Tiam1/Abhd6/Rgs7bp/Epha7/Snap25/Atp2b2/Adcy1/Adgrl2/Cacng2
## GO:0099240                             Lpar2/Grik1/Abhd6/Rgs7bp/Epha7/Snap25/Atp2b2/Adcy1/Adgrl2/Cacng2
## GO:0014069              Nefh/Met/Grik1/Rapgef4/Tiam1/Rgs7bp/Epha7/Ctnnd2/Map1b/Atp2b2/Pak2/Adcy1/Cacng2
## GO:0032279              Nefh/Met/Grik1/Rapgef4/Tiam1/Rgs7bp/Epha7/Ctnnd2/Map1b/Atp2b2/Pak2/Adcy1/Cacng2
## GO:0098839                                                       Tiam1/Rgs7bp/Epha7/Atp2b2/Adcy1/Cacng2
## GO:0098984              Nefh/Met/Grik1/Rapgef4/Tiam1/Rgs7bp/Epha7/Ctnnd2/Map1b/Atp2b2/Pak2/Adcy1/Cacng2
##            Count
## GO:0097060    15
## GO:0099240    10
## GO:0014069    13
## GO:0032279    13
## GO:0098839     6
## GO:0098984    13

7.1.1.2 plot

up.go.res[["plot"]]


7.1.2 Down regulated

7.1.2.1 overview

down.go.res=all.go.results[["DOWN"]][["GO"]]
down.go.res.table = down.go.res[["table"]]
head(down.go.res.table)
##            ONTOLOGY         ID       Description GeneRatio   BgRatio
## GO:0005604       CC GO:0005604 basement membrane     7/229 114/23271
##                  pvalue   p.adjust     qvalue
## GO:0005604 0.0001374573 0.04178703 0.04007966
##                                                geneID Count
## GO:0005604 Ntn4/Ntn1/Col4a2/Egflam/Papln/Fras1/Col4a4     7

7.1.2.2 plot

down.go.res[["plot"]]


7.2 KEGG enrichment

# save results to working directory
# ConductFE(deres = dds.results.ordered, signif = "pvalue", l2fc.threashold = 0.3,enrich.type = "KEGG")
all.kegg.results=ConductFE(deres = dds.results.ordered, signif = "pvalue", l2fc.threashold = 0.3,enrich.type = "KEGG", str.width = 50, save = F)
## Differential expression analysis with DESeq2!
## Convert ENSEMBL to ENTREZID!
## 'select()' returned 1:many mapping between keys and columns
## conduct KEGG enrichment analysis.
## Reading KEGG annotation online:
## 
## Reading KEGG annotation online:
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Convert ENSEMBL to ENTREZID!
## 'select()' returned 1:1 mapping between keys and columns
## conduct KEGG enrichment analysis.

7.2.1 Up

7.2.1.1 overview

up.kegg.res=all.kegg.results[["UP"]][["KEGG"]]
up.kegg.res.table=up.kegg.res[["table"]]
head(up.kegg.res.table)
##                ID            Description GeneRatio  BgRatio       pvalue
## mmu04911 mmu04911      Insulin secretion     7/100  86/8995 4.452578e-05
## mmu04024 mmu04024 cAMP signaling pathway    10/100 220/8995 1.580727e-04
##             p.adjust      qvalue
## mmu04911 0.009751146 0.008623941
## mmu04024 0.017308964 0.015308096
##                                                                    geneID Count
## mmu04911                  Kcnmb2/Kcnn3/Creb5/Rapgef4/Creb3l2/Snap25/Adcy1     7
## mmu04024 Cftr/Atp2a3/Creb5/Rapgef4/Tiam1/Creb3l2/Sox9/Atp2b2/Crebbp/Adcy1    10

7.2.1.2 plot

up.kegg.res[["plot"]]


7.2.2 Down

7.2.2.1 overview

down.kegg.res=all.kegg.results[["DOWN"]][["KEGG"]]
down.kegg.res.table=down.kegg.res[["table"]]
head(down.kegg.res.table)
## [1] ID          Description GeneRatio   BgRatio     pvalue      p.adjust   
## [7] qvalue      geneID      Count      
## <0 rows> (or 0-length row.names)

7.2.2.2 plot

down.kegg.res[["plot"]]


7.3 GSEA

For human, we can download gmt file from MSigDB. For other species, we can obtain gene sets via msigdbr.

7.3.1 msigdbr gene sets

library(msigdbr)
# list all possible species
msigdbr_species()
## # A tibble: 20 x 2
##    species_name                species_common_name                              
##    <chr>                       <chr>                                            
##  1 Anolis carolinensis         Carolina anole, green anole                      
##  2 Bos taurus                  bovine, cattle, cow, dairy cow, domestic cattle,…
##  3 Caenorhabditis elegans      roundworm                                        
##  4 Canis lupus familiaris      dog, dogs                                        
##  5 Danio rerio                 leopard danio, zebra danio, zebra fish, zebrafish
##  6 Drosophila melanogaster     fruit fly                                        
##  7 Equus caballus              domestic horse, equine, horse                    
##  8 Felis catus                 cat, cats, domestic cat                          
##  9 Gallus gallus               bantam, chicken, chickens, Gallus domesticus     
## 10 Homo sapiens                human                                            
## 11 Macaca mulatta              rhesus macaque, rhesus macaques, Rhesus monkey, …
## 12 Monodelphis domestica       gray short-tailed opossum                        
## 13 Mus musculus                house mouse, mouse                               
## 14 Ornithorhynchus anatinus    duck-billed platypus, duckbill platypus, platypus
## 15 Pan troglodytes             chimpanzee                                       
## 16 Rattus norvegicus           brown rat, Norway rat, rat, rats                 
## 17 Saccharomyces cerevisiae    baker's yeast, brewer's yeast, S. cerevisiae     
## 18 Schizosaccharomyces pombe … <NA>                                             
## 19 Sus scrofa                  pig, pigs, swine, wild boar                      
## 20 Xenopus tropicalis          tropical clawed frog, western clawed frog
# create gene sets
m_t2g <- msigdbr(species = "Mus musculus", category = "C5", subcategory = "BP") %>% 
  dplyr::select(gs_name, entrez_gene)
head(m_t2g)
## # A tibble: 6 x 2
##   gs_name                                          entrez_gene
##   <chr>                                                  <int>
## 1 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS       67618
## 2 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS      107747
## 3 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS      216188
## 4 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS      108156
## 5 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS      270685
## 6 GOBP_10_FORMYLTETRAHYDROFOLATE_METABOLIC_PROCESS      665563

7.3.2 Conduct GSEA

gsea.results=ConductGSEA(deres = dds.results.ordered,gmt.file = NULL,gene.sets = m_t2g,save = F)
## Differential expression analysis with DESeq2!
## Convert ENSEMBL to ENTREZID!
## 'select()' returned 1:many mapping between keys and columns
## conduct GSEA anaysis.
## preparing geneSet collections...
## GSEA analysis...
## no term enriched under specific pvalueCutoff...

7.3.3 results overview

gsea.results.table=gsea.results[["table"]]
head(gsea.results.table)
## [1] ID              Description     setSize         enrichmentScore
## [5] NES             pvalue          p.adjust        qvalues        
## <0 rows> (or 0-length row.names)

7.3.4 results plot

gsea.results[["plot"]]


8 All-in-one analysis

To enhance the usability of the tool, we create a all-in-one analysis command, all results will be saved in specific directory:

library(msigdbr)
m_t2g <- msigdbr(species = "Mus musculus", category = "C5") %>%
  dplyr::select(gs_name, entrez_gene)
count.file <- system.file("extdata", "snon_count.txt", package = "DEbChIP")
meta.file <- system.file("extdata", "snon_meta.txt", package = "DEbChIP")
ConductDESeq2(count.matrix.file = count.file, meta.file = meta.file, out.folder="/home/songyabing/R/learn/tmp/dekit",
              signif = "pvalue", l2fc.threashold = 0.3,
              group.key = "condition", ref.group = "WT",gmt.file = NULL, gene.sets = m_t2g)

The structrue of results folder:

tree /home/songyabing/R/learn/tmp/dekit
## /home/songyabing/R/learn/tmp/dekit
## ├── CountQC_CPM.pdf
## ├── CountQC_saturation.pdf
## ├── DEG
## │   ├── Condition_KO_WT_all.csv
## │   ├── Condition_KO_WT_pvalue0.05_FC0.3.csv
## │   ├── DEG_GenePlot.pdf
## │   ├── DEG_Heatmap.pdf
## │   ├── DEG_MAPlot.pdf
## │   ├── DEG_RankPlot.pdf
## │   ├── DEG_ScatterPlot.pdf
## │   ├── DEG_VolcanoPlot.pdf
## │   └── normalized_counts.csv
## ├── DEkit_all_in_one.RData
## ├── FE
## │   ├── DOWN_ALL_GO.csv
## │   ├── DOWN_Biological_Process.png
## │   ├── DOWN_Cellular_Component.png
## │   ├── DOWN_Functional_Enrichment.pdf
## │   ├── DOWN_KEGG.csv
## │   ├── DOWN_KEGG_Enrichment.png
## │   ├── DOWN_Molecular_Function.png
## │   ├── UP_ALL_GO.csv
## │   ├── UP_Biological_Process.png
## │   ├── UP_Cellular_Component.png
## │   ├── UP_Functional_Enrichment.pdf
## │   ├── UP_KEGG.csv
## │   ├── UP_KEGG_Enrichment.png
## │   └── UP_Molecular_Function.png
## ├── GSEA
## │   ├── GSEA_enrich_result.csv
## │   ├── GSEA_enrich_result.pdf
## │   └── GSEA_enrich_result.png
## ├── PCA
## │   ├── PC1
## │   │   ├── Negative
## │   │   │   ├── PC1_Negative_ALL_GO.csv
## │   │   │   └── PC1_Negative_KEGG.csv
## │   │   └── Positive
## │   │       ├── PC1_Positive_ALL_GO.csv
## │   │       └── PC1_Positive_KEGG.csv
## │   ├── PC1_Negative_Biological_Process.png
## │   ├── PC1_Negative_Cellular_Component.png
## │   ├── PC1_Negative_Functional_Enrichment.pdf
## │   ├── PC1_Negative_KEGG_Enrichment.png
## │   ├── PC1_Negative_Molecular_Function.png
## │   ├── PC1_Positive_Biological_Process.png
## │   ├── PC1_Positive_Cellular_Component.png
## │   ├── PC1_Positive_Functional_Enrichment.pdf
## │   ├── PC1_Positive_KEGG_Enrichment.png
## │   ├── PC1_Positive_Molecular_Function.png
## │   ├── PC2
## │   │   ├── Negative
## │   │   │   ├── PC2_Negative_ALL_GO.csv
## │   │   │   └── PC2_Negative_KEGG.csv
## │   │   └── Positive
## │   │       ├── PC2_Positive_ALL_GO.csv
## │   │       └── PC2_Positive_KEGG.csv
## │   ├── PC2_Negative_Biological_Process.png
## │   ├── PC2_Negative_Cellular_Component.png
## │   ├── PC2_Negative_Functional_Enrichment.pdf
## │   ├── PC2_Negative_KEGG_Enrichment.png
## │   ├── PC2_Negative_Molecular_Function.png
## │   ├── PC2_Positive_Biological_Process.png
## │   ├── PC2_Positive_Cellular_Component.png
## │   ├── PC2_Positive_Functional_Enrichment.pdf
## │   ├── PC2_Positive_KEGG_Enrichment.png
## │   ├── PC2_Positive_Molecular_Function.png
## │   ├── PC3
## │   │   ├── Negative
## │   │   │   ├── PC3_Negative_ALL_GO.csv
## │   │   │   └── PC3_Negative_KEGG.csv
## │   │   └── Positive
## │   │       ├── PC3_Positive_ALL_GO.csv
## │   │       └── PC3_Positive_KEGG.csv
## │   ├── PC3_Negative_Biological_Process.png
## │   ├── PC3_Negative_Cellular_Component.png
## │   ├── PC3_Negative_Functional_Enrichment.pdf
## │   ├── PC3_Negative_KEGG_Enrichment.png
## │   ├── PC3_Negative_Molecular_Function.png
## │   ├── PC3_Positive_Biological_Process.png
## │   ├── PC3_Positive_Cellular_Component.png
## │   ├── PC3_Positive_Functional_Enrichment.pdf
## │   ├── PC3_Positive_KEGG_Enrichment.png
## │   ├── PC3_Positive_Molecular_Function.png
## │   ├── PC4
## │   │   ├── Negative
## │   │   │   ├── PC4_Negative_ALL_GO.csv
## │   │   │   └── PC4_Negative_KEGG.csv
## │   │   └── Positive
## │   │       ├── PC4_Positive_ALL_GO.csv
## │   │       └── PC4_Positive_KEGG.csv
## │   ├── PC4_Negative_Biological_Process.png
## │   ├── PC4_Negative_Cellular_Component.png
## │   ├── PC4_Negative_Functional_Enrichment.pdf
## │   ├── PC4_Negative_KEGG_Enrichment.png
## │   ├── PC4_Negative_Molecular_Function.png
## │   ├── PC4_Positive_Biological_Process.png
## │   ├── PC4_Positive_Cellular_Component.png
## │   ├── PC4_Positive_Functional_Enrichment.pdf
## │   ├── PC4_Positive_KEGG_Enrichment.png
## │   ├── PC4_Positive_Molecular_Function.png
## │   ├── PC5
## │   │   ├── Negative
## │   │   │   ├── PC5_Negative_ALL_GO.csv
## │   │   │   └── PC5_Negative_KEGG.csv
## │   │   └── Positive
## │   │       ├── PC5_Positive_ALL_GO.csv
## │   │       └── PC5_Positive_KEGG.csv
## │   ├── PC5_Negative_Biological_Process.png
## │   ├── PC5_Negative_Cellular_Component.png
## │   ├── PC5_Negative_Functional_Enrichment.pdf
## │   ├── PC5_Negative_KEGG_Enrichment.png
## │   ├── PC5_Negative_Molecular_Function.png
## │   ├── PC5_Positive_Biological_Process.png
## │   ├── PC5_Positive_Cellular_Component.png
## │   ├── PC5_Positive_Functional_Enrichment.pdf
## │   ├── PC5_Positive_KEGG_Enrichment.png
## │   ├── PC5_Positive_Molecular_Function.png
## │   ├── PCA_3DPCA.pdf
## │   ├── PCA_biplot.pdf
## │   ├── PCA_loading_bar.pdf
## │   ├── PCA_loading_heat.pdf
## │   ├── PCA_overview.pdf
## │   ├── PCA_pairs_plot.pdf
## │   └── PCA_screen_plot.pdf
## └── SampleQC_dist_pcc.pdf
## 
## 19 directories, 107 files

9 Integrate with ChIP-seq

The data used here contains RNA-seq and ChIP-seq datasets from RUNX represses Pmp22 to drive neurofibromagenesis:

  • RNA-seq: two genotypes and three samples per genotype, the raw data are stored in GSE122774
  • ChIP-seq: two genotypes and one sample per genotype, the raw data are stored in GSE122775

9.1 Prepare DEGs

# prepare count matrix and metadata
debchip.count.file <- system.file("extdata", "debchip_count.txt", package = "DEbChIP")
debchip.meta.file <- system.file("extdata", "debchip_meta.txt", package = "DEbChIP")
debchip.count.matrix <- read.table(file = debchip.count.file, header = T, sep = "\t")
debchip.meta.info <- read.table(file = debchip.meta.file, header = T)
# create DESeqDataSet object
debchip.dds <- DESeq2::DESeqDataSetFromMatrix(countData = debchip.count.matrix, 
                                              colData = debchip.meta.info, 
                                              design = ~condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
# set control level
debchip.dds$condition <- relevel(debchip.dds$condition, ref = "NF")
# conduct differential expressed genes analysis
debchip.dds <- DESeq(debchip.dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# extract results
debchip.dds.results <- results(debchip.dds,contrast=c("condition",'RX','NF'))
debchip.dds.results.ordered <- debchip.dds.results[order(debchip.dds.results$log2FoldChange,decreasing = TRUE),]
head(debchip.dds.results.ordered)
## log2 fold change (MLE): condition RX vs NF 
## Wald test p-value: condition RX vs NF 
## DataFrame with 6 rows and 6 columns
##          baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##         <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Sycp1    12.07541        7.15056   3.91075   1.82844 6.74840e-02 1.81006e-01
## Gm16532  10.07020        6.89407   1.74281   3.95573 7.63014e-05 7.17146e-04
## Hs3st4   18.88787        6.82852   1.42952   4.77679 1.78117e-06 2.83848e-05
## Nell1    34.13399        6.80718   1.34257   5.07027 3.97260e-07 7.79916e-06
## Sptssb    9.39376        6.78558   1.52948   4.43651 9.14275e-06 1.16684e-04
## Myo3a     7.19992        6.41077   1.74865   3.66612 2.46259e-04 1.94151e-03

9.2 Prepare ChIP-seq data

9.2.1 Consensus peaks

In this step, we will get consensus peaks with MSPC when multiple peak files are available, but when there is only one peak file, we will use it directly (make sure this peak file contains five columns: “chr”, “start”, “stop”, “name”, “score”).

# get consensus peak
peak.file = system.file("extdata", "debchip_peaks.bed", package = "DEbChIP")
peak.df = GetConsensusPeak(peak.file = peak.file)
head(peak.df)
##     chr     start      stop     name score
## 1 chr13  51519018  51519164  chr13-5  27.6
## 2 chr13  21326999  21327145  chr13-6  25.2
## 3  chr5 151112840 151112986   chr5-2  22.3
## 4 chr11 104361891 104362037 chr11-98  21.8
## 5 chr11  70198551  70198697 chr11-88  21.3
## 6  chr6  29326952  29327098   chr6-4  20.8

9.2.2 Peak profile

Check the profle of consensus peaks:

# peak profile plot
peak.profile = PeakProfile(peak.df ,species="Mouse", by = "gene", region.type = "body", nbin = 800)
## >> preparing promoter regions...  2022-06-30 22时08分15秒 
## >> preparing tag matrix...        2022-06-30 22时08分16秒 
## >> preparing start_site regions by ... 2022-06-30 22时08分16秒
## >> preparing tag matrix...  2022-06-30 22时08分16秒 
## >> generating figure...       2022-06-30 22时08分27秒
## >> done...            2022-06-30 22时08分27秒
## >> binning method is used...2022-06-30 22时08分27秒
## >> preparing start_site regions by gene... 2022-06-30 22时08分27秒
## >> preparing tag matrix by binning...  2022-06-30 22时08分27秒 
## >> Running bootstrapping for tag matrix...        2022-06-30 22时08分33秒 
## >> binning method is used...2022-06-30 22时08分34秒
## >> preparing body regions by gene... 2022-06-30 22时08分34秒
## >> preparing tag matrix by binning...  2022-06-30 22时08分34秒 
## >> preparing matrix with extension from (TSS-20%)~(TTS+20%)... 2022-06-30 22时08分34秒
## >> 1 peaks(0.1536098%), having lengths smaller than 800bp, are filtered... 2022-06-30 22时08分37秒
## >> Running bootstrapping for tag matrix...        2022-06-30 22时09分15秒
peak.profile$profile.plot


9.2.3 Peak annotation

Peak annotation with ChIPseeker:

# peak annotation
peak.anno = AnnoPeak(peak.df = peak.df,species = "Mouse",seq.style = "UCSC",up.dist = 20000,down.dist = 20000)
## >> preparing features information...      2022-06-30 22时09分17秒 
## >> identifying nearest features...        2022-06-30 22时09分17秒 
## >> calculating distance from peak to TSS...   2022-06-30 22时09分17秒 
## >> assigning genomic annotation...        2022-06-30 22时09分17秒 
## >> adding gene annotation...          2022-06-30 22时09分29秒
## 'select()' returned 1:many mapping between keys and columns
## >> assigning chromosome lengths           2022-06-30 22时09分29秒 
## >> done...                    2022-06-30 22时09分30秒
## Warning: Removed 6 rows containing non-finite values (stat_count).
peak.anno.df = peak.anno$df
head(peak.anno.df)
##   seqnames     start       end width strand     name score
## 1    chr13  51519019  51519164   146      *  chr13-5  27.6
## 2    chr13  21327000  21327145   146      *  chr13-6  25.2
## 3     chr5 151112841 151112986   146      *   chr5-2  22.3
## 4    chr11 104361892 104362037   146      * chr11-98  21.8
## 5    chr11  70198552  70198697   146      * chr11-88  21.3
## 6     chr6  29326953  29327098   146      *   chr6-4  20.8
##                                            annotation geneChr geneStart
## 1 Intron (ENSMUST00000021898.5/20418, intron 1 of 11)      13  51431041
## 2                                   Promoter (9-10kb)      13  21317258
## 3                                    Promoter (4-5kb)       5 151095421
## 4 Intron (ENSMUST00000106977.7/76719, intron 4 of 13)      11 104334921
## 5                                  Promoter (14-15kb)      11  70212752
## 6                                    Promoter (7-8kb)       6  29319199
##     geneEnd geneLength geneStrand geneId         transcriptId distanceToTSS
## 1  51567084     136044          2  20418 ENSMUST00000021898.5         47920
## 2  21319624       2367          1  75512 ENSMUST00000136668.1          9742
## 3 151108735      13315          2 243362 ENSMUST00000202866.1         -4106
## 4 104341299       6379          2  76719 ENSMUST00000069188.6        -20593
## 5  70216413       3662          1 216867 ENSMUST00000126388.7        -14055
## 6  29335854      16656          1 330277 ENSMUST00000166462.1          7754
##              ENSEMBL   SYMBOL
## 1 ENSMUSG00000021448     Shc3
## 2 ENSMUSG00000004341     Gpx6
## 3 ENSMUSG00000016128  Stard13
## 4 ENSMUSG00000018412   Kansl1
## 5 ENSMUSG00000040938 Slc16a11
## 6 ENSMUSG00000039742  Fam71f1
##                                                                 GENENAME
## 1               src homology 2 domain-containing transforming protein C3
## 2                                               glutathione peroxidase 6
## 3               StAR-related lipid transfer (START) domain containing 13
## 4                                  KAT8 regulatory NSL complex subunit 1
## 5 solute carrier family 16 (monocarboxylic acid transporters), member 11
## 6                          family with sequence similarity 71, member F1
##       anno
## 1   Intron
## 2 Promoter
## 3 Promoter
## 4   Intron
## 5 Promoter
## 6 Promoter
peak.anno$plots


9.3 Integrate ChIP-seq and RNA-seq

In this step, we will integrate ChIP-seq and RNA-seq to get plausible direct targets of promoter (in this example, Runx). ### Integrate

debchip.res = DEbChIP(de.res = debchip.dds.results,chip.res = peak.anno.df,chip.anno.key = "Promoter", merge.key="SYMBOL")
## Differential expression analysis with DESeq2!
head(debchip.res)
##          SYMBOL geneId         annotation     anno ENSEMBL
## 1 0610012G03Rik   <NA>               <NA>     <NA>    <NA>
## 2 1110002J07Rik  68488 Promoter (12-13kb) Promoter    <NA>
## 3 1110008P14Rik   <NA>               <NA>     <NA>    <NA>
## 4 1110032F04Rik   <NA>               <NA>     <NA>    <NA>
## 5 1500009C09Rik   <NA>               <NA>     <NA>    <NA>
## 6 1500011B03Rik   <NA>               <NA>     <NA>    <NA>
##                     GENENAME log2FoldChange  abundance   signif   regulation
## 1                       <NA>       1.837668  59.179193 3.073773 Up_regulated
## 2 RIKEN cDNA 1110002J07 gene             NA         NA       NA         <NA>
## 3                       <NA>       1.013546 125.765543 2.277566 Up_regulated
## 4                       <NA>       2.935223  89.091244 4.714179 Up_regulated
## 5                       <NA>       3.183848   9.432253 1.492741 Up_regulated
## 6                       <NA>       1.058984 227.077738 2.656922 Up_regulated
##   Type
## 1   UP
## 2 ChIP
## 3   UP
## 4   UP
## 5   UP
## 6   UP

9.3.1 Integrate summary

# DE and ChIP venn plot
debchip.plot = PlotDEbChIP(debchip.res,show_percentage=FALSE)
debchip.plot


9.3.2 Functional enrichment

# functional enrichment on direct targets
debchip.fe.results = DEbChIPFE(de.chip = debchip.res,gene.type = "ENTREZID",species="Mouse",save = F)
## conduct ALL GO enrichment analysis on: UPbChIP
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## conduct ALL GO enrichment analysis on: DOWNbChIP
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## wrong orderBy parameter; set to default `orderBy = "x"`
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

9.3.2.1 FE on up-regulated targets

up.debchip.fe.results=debchip.fe.results[["UPbChIP"]][["GO"]]
head(up.debchip.fe.results[["table"]])
##            ONTOLOGY         ID                                    Description
## GO:0032288       BP GO:0032288                                myelin assembly
## GO:0019233       BP GO:0019233                     sensory perception of pain
## GO:0043951       BP GO:0043951 negative regulation of cAMP-mediated signaling
## GO:0022011       BP GO:0022011       myelination in peripheral nervous system
## GO:0032292       BP GO:0032292    peripheral nervous system axon ensheathment
## GO:0014044       BP GO:0014044                       Schwann cell development
##            GeneRatio   BgRatio       pvalue   p.adjust     qvalue
## GO:0032288      2/21  23/23328 0.0001930551 0.02159856 0.01416299
## GO:0019233      3/21 146/23328 0.0002940462 0.02159856 0.01416299
## GO:0043951      2/21  30/23328 0.0003306741 0.02159856 0.01416299
## GO:0022011      2/21  31/23328 0.0003532875 0.02159856 0.01416299
## GO:0032292      2/21  31/23328 0.0003532875 0.02159856 0.01416299
## GO:0014044      2/21  34/23328 0.0004255310 0.02159856 0.01416299
##                      geneID Count
## GO:0032288        Pmp22/Prx     2
## GO:0019233 Kcnip3/Npy2r/Prx     3
## GO:0043951     Npy2r/Rnf157     2
## GO:0022011        Pmp22/Prx     2
## GO:0032292        Pmp22/Prx     2
## GO:0014044        Pmp22/Prx     2
up.debchip.fe.results[["plot"]]

9.3.2.2 FE on down-regulated targets

down.debchip.fe.results=debchip.fe.results[["DOWNbChIP"]][["GO"]]
head(down.debchip.fe.results[["table"]])
##            ONTOLOGY         ID                                  Description
## GO:0010810       BP GO:0010810        regulation of cell-substrate adhesion
## GO:0045785       BP GO:0045785         positive regulation of cell adhesion
## GO:0007162       BP GO:0007162         negative regulation of cell adhesion
## GO:0030198       BP GO:0030198            extracellular matrix organization
## GO:0043062       BP GO:0043062         extracellular structure organization
## GO:0051250       BP GO:0051250 negative regulation of lymphocyte activation
##            GeneRatio   BgRatio       pvalue   p.adjust      qvalue
## GO:0010810      5/33 214/23328 1.193628e-05 0.01428434 0.008185655
## GO:0045785      6/33 435/23328 2.935433e-05 0.01428434 0.008185655
## GO:0007162      5/33 290/23328 5.117751e-05 0.01428434 0.008185655
## GO:0030198      5/33 302/23328 6.201521e-05 0.01428434 0.008185655
## GO:0043062      5/33 303/23328 6.299252e-05 0.01428434 0.008185655
## GO:0051250      4/33 158/23328 7.112535e-05 0.01428434 0.008185655
##                                             geneID Count
## GO:0010810           Col8a1/Fbln2/Mmp14/Postn/Sdc4     5
## GO:0045785 Col8a1/Fbln2/Il4ra/Ptpn22/Sdc4/Tnfsf13b     6
## GO:0007162           Il4ra/Mmp14/Postn/Ptpn22/Sdc4     5
## GO:0030198           Bcl3/Col8a1/Fbln2/Mmp14/Postn     5
## GO:0043062           Bcl3/Col8a1/Fbln2/Mmp14/Postn     5
## GO:0051250                   Il4ra/Lyn/Ptpn22/Sdc4     4
down.debchip.fe.results[["plot"]]


10 Utils

10.1 Gene name conversion

10.1.1 results from DE analysis

# convert for DESeq2 results
dds.results.ordered <- IDConversion(deres = dds.results.ordered, gene.type = "ENSEMBL",sort.key = "log2FoldChange")
## 'select()' returned 1:many mapping between keys and columns
head(dds.results.ordered)
##                           baseMean log2FoldChange     lfcSE       stat
## ENSMUSG00000000001.4  4178.2932143     0.03256101 0.1097033  0.2968099
## ENSMUSG00000000003.15    0.0000000             NA        NA         NA
## ENSMUSG00000000028.14  416.5975492    -0.13238461 0.1709810 -0.7742650
## ENSMUSG00000000031.16  200.5762259     0.57949970 0.7478555  0.7748819
## ENSMUSG00000000037.16  202.0237326     0.13726485 0.2659695  0.5160925
## ENSMUSG00000000049.11    0.3930718    -2.07400983 2.6239287 -0.7904216
##                          pvalue      padj ENTREZID SYMBOL
## ENSMUSG00000000001.4  0.7666117 0.9997304    14679  Gnai3
## ENSMUSG00000000003.15        NA        NA    54192   Pbsn
## ENSMUSG00000000028.14 0.4387741 0.9997304    12544  Cdc45
## ENSMUSG00000000031.16 0.4384095 0.9997304    14955    H19
## ENSMUSG00000000037.16 0.6057898 0.9997304   107815  Scml2
## ENSMUSG00000000049.11 0.4292816 0.9997304    11818   Apoh

10.1.2 normal matrix

count.matrix <- IDConversion(deres = count.matrix, gene.type = "ENSEMBL",sort.key=NULL)
## 'select()' returned 1:many mapping between keys and columns
head(count.matrix)
##                        KO1  KO2  KO3  KO4  KO5  KO6  WT1  WT2  WT3  WT4  WT5
## ENSMUSG00000000001.4  4556 4218 3835 3718 5741 3875 4115 5074 2931 4374 5118
## ENSMUSG00000000003.15    0    0    0    0    0    0    0    0    0    0    0
## ENSMUSG00000000028.14  350  579  435  316  432  317  245  621  419  506  545
## ENSMUSG00000000031.16  268  804   66  207   46   66  336  112   60   69  137
## ENSMUSG00000000037.16  262  157  184  162  301  233  311  176   94  139  229
## ENSMUSG00000000049.11    0    0    0    0    0    0    1    0    2    0    1
##                        WT6 ENTREZID SYMBOL
## ENSMUSG00000000001.4  3625    14679  Gnai3
## ENSMUSG00000000003.15    0    54192   Pbsn
## ENSMUSG00000000028.14  371    12544  Cdc45
## ENSMUSG00000000031.16  193    14955    H19
## ENSMUSG00000000037.16  186   107815  Scml2
## ENSMUSG00000000049.11    0    11818   Apoh

10.2 Count normalization

Here, we provide five different normalization methods: CPM, TMM, DESeq2’s median of ratios, RPKM and TPM, of which RPKM and TPM need to provide gtf files to calculate gene length.

10.2.1 CPM

Counts Per Million reads mapped (CPM) takes into account the sequencing depth. For each feature i, CPM is the count of sequenced fragments mapping to the feature scaled by the total number of reads times one million (to bring it up to a more convenient number).

\[ \mathrm{CPM}=\frac{\text { Number of reads mapped to gene } \times 10^{6}}{\text { Total number of mapped reads }} \]

cpm.matrix=NormalizedCount(dds, norm.type="CPM")
head(cpm.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  405.92819 400.64944 389.764529 377.62444 380.625026
## ENSMUSG00000000003.15   0.00000   0.00000   0.000000   0.00000   0.000000
## ENSMUSG00000000028.14  31.18412  54.99669  44.210579  32.09503  28.641354
## ENSMUSG00000000031.16  23.87813  76.36846   6.707812  21.02428   3.049774
## ENSMUSG00000000037.16  23.34354  14.91275  18.700567  16.45378  19.956128
## ENSMUSG00000000049.11   0.00000   0.00000   0.000000   0.00000   0.000000
##                              KO6         WT1        WT2         WT3        WT4
## ENSMUSG00000000001.4  362.974809 424.0865495 343.046563 389.7566993 402.307594
## ENSMUSG00000000003.15   0.000000   0.0000000   0.000000   0.0000000   0.000000
## ENSMUSG00000000028.14  29.693681  25.2493814  41.985005  55.7175220  46.540385
## ENSMUSG00000000031.16   6.182281  34.6277231   7.572175   7.9786428   6.346416
## ENSMUSG00000000037.16  21.825324  32.0512556  11.899132  12.4998737  12.784809
## ENSMUSG00000000049.11   0.000000   0.1030587   0.000000   0.2659548   0.000000
##                                WT5       WT6
## ENSMUSG00000000001.4  383.64870911 355.77287
## ENSMUSG00000000003.15   0.00000000   0.00000
## ENSMUSG00000000028.14  40.85356516  36.41151
## ENSMUSG00000000031.16  10.26961179  18.94184
## ENSMUSG00000000037.16  17.16599343  18.25483
## ENSMUSG00000000049.11   0.07496067   0.00000

10.2.2 TMM

edgeR’s Trimmed Mean of M values (TMM) takes into account the sequencing depth, RNA composition, and gene length. For detailed information, please refer to A scaling normalization method for differential expression analysis of RNA-seq data.

tmm.matrix=NormalizedCount(dds, norm.type="TMM")
head(tmm.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  421.83331 371.93907 371.189992 382.69198 413.745541
## ENSMUSG00000000003.15   0.00000   0.00000   0.000000   0.00000   0.000000
## ENSMUSG00000000028.14  32.40598  51.05565  42.103689  32.52573  31.133613
## ENSMUSG00000000031.16  24.81372  70.89592   6.388146  21.30641   3.315153
## ENSMUSG00000000037.16  24.25819  13.84410  17.809376  16.67458  21.692633
## ENSMUSG00000000049.11   0.00000   0.00000   0.000000   0.00000   0.000000
##                              KO6         WT1        WT2         WT3       WT4
## ENSMUSG00000000001.4  373.041177 486.1604573 334.563934 361.3055744 394.46383
## ENSMUSG00000000003.15   0.000000   0.0000000   0.000000   0.0000000   0.00000
## ENSMUSG00000000028.14  30.517175  28.9451548  40.946828  51.6503022  45.63299
## ENSMUSG00000000031.16   6.353734  39.6962123   7.384935   7.3962247   6.22268
## ENSMUSG00000000037.16  22.430605  36.7426251  11.604898  11.5874186  12.53554
## ENSMUSG00000000049.11   0.000000   0.1181435   0.000000   0.2465408   0.00000
##                                WT5       WT6
## ENSMUSG00000000001.4  371.61142274 347.48647
## ENSMUSG00000000003.15   0.00000000   0.00000
## ENSMUSG00000000028.14  39.57175174  35.56344
## ENSMUSG00000000031.16   9.94739447  18.50066
## ENSMUSG00000000037.16  16.62739660  17.82965
## ENSMUSG00000000049.11   0.07260872   0.00000

10.2.3 DESeq2’s median of ratios

DESeq2’s median of ratios takes into account the sequencing depth and RNA composition. For detailed information, please refer to Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.

deseq2.matrix=NormalizedCount(dds, norm.type="DESeq2")
head(deseq2.matrix)
##                             KO1       KO2        KO3       KO4        KO5
## ENSMUSG00000000001.4  4606.4055 4014.0575 4037.53736 4145.2746 4476.42605
## ENSMUSG00000000003.15    0.0000    0.0000    0.00000    0.0000    0.00000
## ENSMUSG00000000028.14  353.8722  551.0050  457.97360  352.3149  336.84307
## ENSMUSG00000000031.16  270.9650  765.1262   69.48565  230.7886   35.86755
## ENSMUSG00000000037.16  264.8986  149.4090  193.71757  180.6171  234.69853
## ENSMUSG00000000049.11    0.0000    0.0000    0.00000    0.0000    0.00000
##                              KO6         WT1        WT2         WT3       WT4
## ENSMUSG00000000001.4  4069.51479 5285.859380 3591.55792 3889.896093 4266.4884
## ENSMUSG00000000003.15    0.00000    0.000000    0.00000    0.000000    0.0000
## ENSMUSG00000000028.14  332.91257  314.710947  439.56592  556.078630  493.5627
## ENSMUSG00000000031.16   69.31303  431.603585   79.27759   79.629398   67.3040
## ENSMUSG00000000037.16  244.69599  399.490223  124.57907  124.752724  135.5834
## ENSMUSG00000000049.11    0.00000    1.284534    0.00000    2.654313    0.0000
##                                WT5       WT6
## ENSMUSG00000000001.4  3981.8764585 3774.6246
## ENSMUSG00000000003.15    0.0000000    0.0000
## ENSMUSG00000000028.14  424.0177159  386.3133
## ENSMUSG00000000031.16  106.5879396  200.9662
## ENSMUSG00000000037.16  178.1652421  193.6773
## ENSMUSG00000000049.11    0.7780142    0.0000

10.2.4 RPKM

Reads Per Kilobase of exon per Million reads mapped (RPKM) takes into account the sequencing depth and gene length. For each feature i as the count scaled by the feature’s length times one thousand (to kilobase) and further scaled by the total number of reads times one million.

\[ \text { RPKM }=\frac{\text { Number of reads mapped to gene } \times 10^{3} \times 10^{6}}{\text { Total number of mapped reads } \times \text { gene length }} \]

rpkm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="RPKM")

10.2.5 TPM

Transcripts Per kilobase Million (TPM) takes into account the sequencing depth and gene length.

\[ \mathrm{TPM}=\frac{\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }}}{\sum(\frac{\text { total reads mapped to gene } \times 10^{3}}{\text { gene length }})} \times 10^{6} \]

TPM is proportional to RPKM:

\[ \mathrm{TPM}=\frac{R P K M}{\sum(R P K M)} \times 10^{6} \]

tpm.matrix=NormalizedCount(dds,gtf.file = '/path/to/gtf', norm.type="TPM")

11 Session info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /home/softwares/anaconda3/envs/r4.0/lib/libopenblasp-r0.3.12.so
## 
## locale:
##  [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
##  [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
##  [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
##  [2] GenomicFeatures_1.42.2                   
##  [3] msigdbr_7.5.1                            
##  [4] ggrepel_0.9.1                            
##  [5] ggplot2_3.3.5                            
##  [6] edgeR_3.32.1                             
##  [7] limma_3.46.0                             
##  [8] org.Mm.eg.db_3.12.0                      
##  [9] AnnotationDbi_1.52.0                     
## [10] DESeq2_1.30.1                            
## [11] SummarizedExperiment_1.20.0              
## [12] Biobase_2.50.0                           
## [13] MatrixGenerics_1.2.1                     
## [14] matrixStats_0.58.0                       
## [15] GenomicRanges_1.42.0                     
## [16] GenomeInfoDb_1.26.4                      
## [17] IRanges_2.24.1                           
## [18] S4Vectors_0.28.1                         
## [19] BiocGenerics_0.42.0                      
## [20] DEbChIP_0.1.0                            
## [21] knitr_1.37                               
## [22] BiocStyle_2.18.1                         
## 
## loaded via a namespace (and not attached):
##   [1] ggvenn_0.1.9                           
##   [2] utf8_1.2.1                             
##   [3] tidyselect_1.1.0                       
##   [4] RSQLite_2.2.5                          
##   [5] grid_4.0.3                             
##   [6] BiocParallel_1.24.1                    
##   [7] scatterpie_0.1.7                       
##   [8] munsell_0.5.0                          
##   [9] ragg_0.4.0                             
##  [10] codetools_0.2-18                       
##  [11] withr_2.4.1                            
##  [12] misc3d_0.9-1                           
##  [13] colorspace_2.0-0                       
##  [14] GOSemSim_2.16.1                        
##  [15] highr_0.8                              
##  [16] rstudioapi_0.13                        
##  [17] robustbase_0.95-0                      
##  [18] DOSE_3.16.0                            
##  [19] labeling_0.4.2                         
##  [20] GenomeInfoDbData_1.2.4                 
##  [21] polyclip_1.10-0                        
##  [22] bit64_4.0.5                            
##  [23] farver_2.1.0                           
##  [24] downloader_0.4                         
##  [25] vctrs_0.3.7                            
##  [26] generics_0.1.0                         
##  [27] xfun_0.30                              
##  [28] BiocFileCache_1.14.0                   
##  [29] R6_2.5.0                               
##  [30] doParallel_1.0.16                      
##  [31] clue_0.3-59                            
##  [32] graphlayouts_0.7.1                     
##  [33] rsvd_1.0.3                             
##  [34] locfit_1.5-9.4                         
##  [35] gridGraphics_0.5-1                     
##  [36] bitops_1.0-6                           
##  [37] cachem_1.0.4                           
##  [38] fgsea_1.16.0                           
##  [39] DelayedArray_0.16.3                    
##  [40] assertthat_0.2.1                       
##  [41] scales_1.1.1                           
##  [42] ggraph_2.0.5                           
##  [43] enrichplot_1.10.2                      
##  [44] gtable_0.3.0                           
##  [45] beachmat_2.6.4                         
##  [46] Cairo_1.5-12.2                         
##  [47] sva_3.38.0                             
##  [48] tidygraph_1.2.0                        
##  [49] rlang_0.4.10                           
##  [50] genefilter_1.72.1                      
##  [51] systemfonts_1.0.1                      
##  [52] GlobalOptions_0.1.2                    
##  [53] splines_4.0.3                          
##  [54] rtracklayer_1.50.0                     
##  [55] checkmate_2.0.0                        
##  [56] BiocManager_1.30.16                    
##  [57] yaml_2.2.1                             
##  [58] reshape2_1.4.4                         
##  [59] backports_1.2.1                        
##  [60] qvalue_2.22.0                          
##  [61] clusterProfiler_3.18.1                 
##  [62] tcltk_4.0.3                            
##  [63] tools_4.0.3                            
##  [64] bookdown_0.26                          
##  [65] ggplotify_0.1.0                        
##  [66] ellipsis_0.3.1                         
##  [67] gplots_3.1.1                           
##  [68] jquerylib_0.1.3                        
##  [69] RColorBrewer_1.1-2                     
##  [70] Rcpp_1.0.6                             
##  [71] plyr_1.8.6                             
##  [72] sparseMatrixStats_1.2.1                
##  [73] progress_1.2.2                         
##  [74] zlibbioc_1.36.0                        
##  [75] purrr_0.3.4                            
##  [76] RCurl_1.98-1.3                         
##  [77] prettyunits_1.1.1                      
##  [78] openssl_1.4.3                          
##  [79] GetoptLong_1.0.5                       
##  [80] viridis_0.6.1                          
##  [81] cowplot_1.1.1                          
##  [82] cluster_2.1.1                          
##  [83] magrittr_2.0.1                         
##  [84] data.table_1.14.2                      
##  [85] DO.db_2.9                              
##  [86] circlize_0.4.15                        
##  [87] mvtnorm_1.1-2                          
##  [88] patchwork_1.0.0                        
##  [89] hms_1.0.0                              
##  [90] evaluate_0.14                          
##  [91] xtable_1.8-4                           
##  [92] XML_3.99-0.6                           
##  [93] gridExtra_2.3                          
##  [94] shape_1.4.6                            
##  [95] ggupset_0.3.0                          
##  [96] compiler_4.0.3                         
##  [97] biomaRt_2.46.3                         
##  [98] tibble_3.1.0                           
##  [99] KernSmooth_2.23-18                     
## [100] crayon_1.4.1                           
## [101] shadowtext_0.0.9                       
## [102] htmltools_0.5.2                        
## [103] mgcv_1.8-34                            
## [104] pcaPP_2.0-1                            
## [105] ggfun_0.0.4                            
## [106] rrcov_1.7-0                            
## [107] tidyr_1.1.3                            
## [108] geneplotter_1.68.0                     
## [109] DBI_1.1.1                              
## [110] tweenr_1.0.2                           
## [111] ChIPseeker_1.33.0.900                  
## [112] dbplyr_2.1.1                           
## [113] ComplexHeatmap_2.13.1                  
## [114] MASS_7.3-53.1                          
## [115] rappdirs_0.3.3                         
## [116] boot_1.3-28                            
## [117] babelgene_21.4                         
## [118] Matrix_1.3-3                           
## [119] cli_2.4.0                              
## [120] parallel_4.0.3                         
## [121] igraph_1.2.6                           
## [122] pkgconfig_2.0.3                        
## [123] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [124] rvcheck_0.1.8                          
## [125] GenomicAlignments_1.26.0               
## [126] xml2_1.3.2                             
## [127] foreach_1.5.1                          
## [128] PCAtools_2.2.0                         
## [129] annotate_1.68.0                        
## [130] bslib_0.3.1                            
## [131] dqrng_0.2.1                            
## [132] DEFormats_1.18.0                       
## [133] XVector_0.30.0                         
## [134] yulab.utils_0.0.4                      
## [135] stringr_1.4.0                          
## [136] digest_0.6.27                          
## [137] Biostrings_2.58.0                      
## [138] rmarkdown_2.14                         
## [139] fastmatch_1.1-3                        
## [140] DelayedMatrixStats_1.12.3              
## [141] curl_4.3                               
## [142] Rsamtools_2.6.0                        
## [143] gtools_3.8.2                           
## [144] NOISeq_2.34.0                          
## [145] rjson_0.2.20                           
## [146] nlme_3.1-152                           
## [147] lifecycle_1.0.0                        
## [148] jsonlite_1.7.2                         
## [149] viridisLite_0.4.0                      
## [150] askpass_1.1                            
## [151] fansi_0.4.2                            
## [152] pillar_1.5.1                           
## [153] lattice_0.20-45                        
## [154] DEoptimR_1.0-11                        
## [155] fastmap_1.1.0                          
## [156] httr_1.4.2                             
## [157] plotrix_3.8-2                          
## [158] survival_3.2-10                        
## [159] GO.db_3.12.1                           
## [160] glue_1.4.2                             
## [161] png_0.1-7                              
## [162] iterators_1.0.13                       
## [163] plot3D_1.4                             
## [164] bit_4.0.4                              
## [165] ggforce_0.3.3                          
## [166] stringi_1.5.3                          
## [167] sass_0.4.1                             
## [168] blob_1.2.1                             
## [169] textshaping_0.1.2                      
## [170] BiocSingular_1.6.0                     
## [171] caTools_1.18.2                         
## [172] memoise_2.0.0                          
## [173] dplyr_1.0.5                            
## [174] irlba_2.3.5